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ABSTRACT 

This paper addresses the pattern classification problem aris- 
ing when available target data include some uncertainty in- 
formation. Target data considered here is either qualitative 
(a class label) or quantitative (an estimation of the posterior 
probability). Our main contribution is a SVM inspired for- 
mulation of this problem allowing to take into account class 
label through a hinge loss as well as probability estimates us- 
ing £-insensitive cost function together with a minimum norm 
(maximum margin) objective. This formulation shows a dual 
form leading to a quadratic problem and allows the use of a 
representer theorem and associated kernel. The solution pro- 
vided can be used for both decision and posterior probability 
estimation. Based on empirical evidence our method outper- 
forms regular SVM in terms of probability predictions and 
classification performances. 

Index Terms — support vector machines, maximal mar- 
gin algorithm, uncertain labels. 

1. INTRODUCTION 

In the mainstream supervised classification scheme, an ex- 
pert is required for labelling a set of data used then as inputs 
for training the classifier However, even for an expert, this 
labeling task is likely to be difficult in many applications. In 
the end the training data set may contain inaccurate classes 
for some examples, which leads to non robust classifiers IT]. 
For instance, this is often the case in medical imaging where 
radiologists have to outline what they think are malignant 
tissues over medical images without access to the reference 
histopatologic information. We propose to deal with these 
uncertainties by introducing probabilistic labels in the learn- 
ing stage so as to: 1 . stick to the real life annotation problem, 
2. avoid discarding uncertain data, 3. balance the influence 
of uncertain data in the classification process. 
Our study focuses on the widely used Support Vector Ma- 
chines (SVM) two-class classification problem JJ]. This 
method aims a finding the separating hyperplane maximizing 
the margin between the examples of both classes. Several 
mappings from SVM scores to class membership proba- 
bilities have been proposed in the literature O] ID. In our 



approach, we propose to use both labels and probabilities as 
input thus learning simultaneously a classifier and a prob- 
abilistic output. Note that the output of our classifier may 
be transformed to probability estimations without using any 
mapping algorithm. 

In section |2] we define our new SVM problem formulation 
(referred to as P-SVM) to deal with certain and probabilis- 
tic labels simultaneously. Section [3] describes the whole 
framework of P-SVM and presents the associated quadratic 
problem. Finally, in section|5]we compare its performances to 
the classical SVM formulation (C-SVM) over different data 
sets to demonstrate its potential. 

2. PROBLEM FORMULATION 

We present below a new formulation for the two-class clas- 
sification problem dealing with uncertain labels. Let X be a 
feature space. We define {xi, li)i=i...m the learning dataset of 
input vectors {xi)i=i,,,m S X along with their corresponding 
labels {li)i=i...m, the latter of which being 

• class labels: k = yi G { — 1,+!} for i = 1 . . .n (in 
classification), 

• real values: h = Pi G [0, 1] for i = n + 1 . . . m (in 
regression). 

Pi, associated to point Xj allows to consider uncertainties 
about point x/s class. We define it as the posterior probabil- 
ity for class 1 . 

p, = p{x{) = P(y, = 1 I X, = .T,;). 

We define the associated pattern recognition problem as 

min ^\\wf (1) 

W 

subject to ly^{w^^^ +&)>!, « = 1-^ 

^z^ < Xi + b < , i ~ n + 1...771 

Where boundaries z~, zf directly depend on pi. This for- 
mulation consists in minimizing the complexity of the model 
while forcing good classification and good probability esti- 
mation (close to Pi). Obviously, if 71 = m, we are brought 
back to the classical SVM problem formulation. 

Following the idea of soft margin introduced in regular 
SVM to deal with the case of inseparable data, we introduce 



slack variables ^i. This measure the degree of misclassifi- 
cation of the datum Xi thus relaxing hard constraints of the 
initial optimization problem which becomes 

1,, 



mm — w 



CJ2^^+C J2 (2) 



subject to 

'yi{w^Xi + &) > 1 - 6, 
zl - ^'^ < Xi + b < 

< e., 

.0<CandO<e+, 



i—n+l 

i ~ l...n 

i ^ n + l...m 

i ~ l...n 

i = n + l...m 



Parameters C and C are predefined positive real numbers con- 
trolling the relative weighting of classification and regression 
performances. 

Let e be the labelling precision and 5 the confidence we have 
in the labelling. Let's define rj = e + 5. Then, the regression 
problem consists in finding optimal parameters w and h such 
that 

Thus constraining the probability prediction for point Xi to 
remain around to -a{vj^^+b) within distance 77 EE 111. 

The boundaries (where Xi + b — ±1), define parameter a 
as: 

a = ln(i - 1) 

max(0, Pi — rt) < -, — , — 

Finally: _ " ^ + g-a(»-..+6) 

<J=^ Z' < Xi + 6 < 



< min(pi + rj, 1) 



where z' = -i ln(^^ 1) and z+ = -i ln(^ - 1). 

3. DUAL FORMULATION 

We can rewrite the problem in its dual form, introducing La- 
grange multipliers. We are looking for a stationary point for 
the Lagrange function L defined as 



cE^' + ^ E (er + e+) 

■i— 1 i— Ti+1 

n ri 

- E aMw^x, + 6) - (1 - eO) - E 

- E Mr((«^^2:, + 6)-(zr-o)- E "^rc 

Z— n+l 2=71+1 

m m 



with a > 0, /3 > 0, /i+ > 0, ^- > 0,7+ > and 7" > 
Computing the derivatives of L with respect to b, ^, ^~ and 



^+ leads to the following optimality conditions: 



<ai<C, 
<M+<C', 

<f^-<c, 



i = l...n 

i = n + l...m 

i = n + l...m 



-E 



\=n+l 



{^1+ - )Xi 



whereei = [l_^ ]^ and 63 [0__^ ] 

n times (m-n) times n times (m-n) times 

Calculations simplifications then lead to 

i'(w,6,f,^~,^+,a,/3,Ai,7^,7") = 



-^w^w + ^ ai 



E 

i—n~\-l 



E A^^.^ 



Finally, let V = [ai . . . a„ ^++1 . . . /i+ ^„+i . . . be a 

vector of dimension 2m — n. Then 



where 



w = G r 



with 



G = 



K3 
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•^jyi^i—l...n,j—n+l...m: 
'^j^i,j—n-\-l...m: 



The dual formulation becomes 
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with 



e 



n times 
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n+l 



^n+1 ■ 



and 



n-m times n-m times 
with ^ ] 

n-m times n-m times 

< r < [£__^ ]^ 

n times n-m times n-m times 



4. KERNELIZATION 



(3) 



Formulations (|2|i and (O can be easily generalized by intro- 
ducing kernel functions. Let k be a positive kernel satisfying 
Mercer's condition and H the associated Reproducing Kernel 
Hilbert Space (RKHS). Within this framework equation (|2]i 

becomes 



mm 



_j n m 

T^mi + cY^i-^ + c E (er+c+) (4) 



i = l...n 

i = n + l...m 

i — l...n 

i = n + l...m 



subject to 

'y^if{x^)+b) >l-^^, 

< 6, 

< er and < e+ 

Formulation ^ remains identical, with 

1^2 ~ {k{Xi, Xj)yi)i—i,,,n,j=n+l...mj 
-^3 (-^i 7 ) n+1. . .m 5 

5. EXAMPLES 



In order to experimentally evaluate the proposed method for 
handling uncertain labels in S VM classification, we have sim- 
ulated different data sets described below. In these numerical 
examples, a RBF kernel k{u, v) = e""""'"" Z^*^ is used and 
C = C = 100. We implemented our method using the S VM- 
KM Toolbox jSl. We compare the classification performances 
and probabilistic predictions of the C-SVM and P-SVM ap- 
proaches. In the first case, probabilities are estimated by using 
Piatt's scaling algorithm |[3] while in the second case, proba- 
bilities are directly estimated via the formula defined in ©I 
P[y — 1\x) = T^^^^Jl^.T ■ Performances are evaluated by 
computing 



• Accuracy (Acc) 

Proportion of well predicted examples in the test set 
(for evaluating classification). 

• KuUback Leibler distance (KL) 

Dkl{P\\Q) = jZnv^ = l|x,)log(-f^^^^^''^^) 



i=l 



'Q{yi = l\xi)^ 
for probability distributions P and Q (for evaluating 
probability estimation). 

5.1. Probability estimation 

We generate two unidimensional datasets, labelled 'h-T and 
from normal distributions of variances <J^i= (7^=0.3 
and means /i_i=-0.5 and fii=+Q.5. Let's (x')j^i denote 
the learning data set (n'=200) and „* the test set 

(n*=1000). We compute, for each point Xi, its true probabil- 
ity P{yi = +l|a;i) to belong to class 'h-I'. From here on, 
learning data are labelled in two ways, as follows 

a) For i = 1 . . . n', we get the regular SVM dataset by sim- 
ply using a probability of 0.5 as the threshold for assign- 
ing class labels yi associated to point Xi. This is what 
would be done in practical cases when the data contains 
class membership probabilities and a SVM classifier is 
used. 



if 
if 



P(.yl 



Ik!) 



> 
< 



0.5, 
0.5, 



then 
then 



1, 



This dataset (a;' , is used to train the C-SVM 
classifier 

b) We define another data set (a;-, such that, for 
i = 1 . . . ri, 

if P{y\^\\x'^ > 1-77, theny^ = l, 

if P{yi^l\xi) < 77, theny^ = -l, 



P{yi = '^\Xi) otherwise. 



(6) 

If the probability values are sufficiently close to or 
1 (closeness being defined by the precision and confi- 
dence), we admit that they belong respectively to class -1 
or 1. This probabilistic dataset (x',?;'),^! „i is used to 
train the P-SVM algorithm. 
We compare our two approaches using the test set (a:*)i=i...„t . 
As we know the true probabilities (P(y* = l|a;*))i=i...,it, we 
can estimate the probability prediction error (KL). Figure 
[T] shows the probability predictions performances improve- 
ment shown by the P-SVM: the true probabilities (black) and 
P-SVM estimations (red) are quasi-superimposed (KL=0.2) 
whereas Piatt's estimations are less accurate (KL=11.3). 
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Fig. 1: Probability estimations comparison. Top plot shows 
the true posterior probabilities with C-SVM and P-SVM esti- 
mations overlaying. Lower plot shows the distance between 
true probabilities and estimations. 

5.2. Noise robustness 

We generate two 2D datasets, labelled ' H- 1 ' and ' - 1 ' , from nor- 
mal distributions of variances (j^i=(j\=Q.1 and means = 
(-0.3, -0.5) and yUi=(H-0.3, H-0.5). As in the previous experi- 
ment, we compute class ' 1 ' membership probability for each 
point x^ of the learning data set. We simulate classification 
error by artificially adding a centered uniform noise {5 of am- 
plitude 0. 1), to the probabilities, such that for i ~1 . . .n, 

P{y, ^ l\xi) ^ P{y, ^ l\xi) + 5,. 

We then label learning data following the same scheme as de- 
scribed in (|5]l and Figure |2] shows the margin location 
and probabilities estimations using the two methods over a 
grid of values. Far from learning data points, both probabil- 
ity estimations are less accurate, this being directly linked to 



the choice of a gaussian kernel. However, P-SVM classifica- 
tion and probability estimations obtained for 1000 test points, 
are clearly more alike the ground truth (Accp.svM = 99% , 
KLp.svM = 3.6) than C-SVM (Accc-svm = 95%, KLc-svm = 
95). Contrary to P-SVM which, by combining both classifi- 
cation and regression, predicts good probabilities, C-SVM is 
sensitive to classification noise and is no more converging to 
the Bayes rule as seen in |T1. 

P-SVM probability estimates 




Fig. 2: Probabihty estimations of C-SVM and P-SVM over a 
grid using noisy learning data (uniform noise, amplitude 0. 1). 
Noisy learning data are plotted in blue (class '-!') and red 
(class '1') stars. 

Figure |3] shows the impact of noise amplitude on classi- 
fiers performances (values are averaged over 30 random sim- 
ulations). Even if noise increases, classifications and proba- 
bility predictions performances of the P-SVM remain signifi- 
cantly higher than those of C-SVM. 



6. CONCLUSION 

This paper has presented a new way to take into account both 
qualitative and quantitative target data by shrewdly combin- 
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Fig. 3: Noise impact on P-SVM and C-SVM classification 
performances 



ing both SVM classification and regression loss. Experimen- 
tal results show that our formulation can perform very well on 
simulated data for discrimination as well as posterior proba- 
bility estimation. This approach will soon be applied on clini- 
cal data thus allowing to assess its usefulness in computer as- 
sisted diagnosis for prostate cancer. Note that this framework 
initially designed for probabilistic labels can also be general- 
ized to other dataset involving quantitative data as it can be 
used for instance to estimate a conditional cumulative distri- 
bution function. 
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